[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc by ShaneGZhu · Pull Request #7777 · PaddlePaddle/FastDeploy

ShaneGZhu · 2026-05-11T11:06:18Z

Motivation

Kernel fusion: cast + sigmoid + bias + noauxtc. Currently, this is supported only on CUDA devices.

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

新增 custom_ops/gpu_ops/grouped_topk_kernels.cu：实现 grouped_topk_fused_kernel，一次 kernel launch 完成 cast、sigmoid、bias 加法及 grouped topk 路由；支持 float32/bfloat16/float16 输入
custom_ops/gpu_ops/cpp_extensions.cc：新增 grouped_topk 函数声明及 pybind11 binding
custom_ops/setup_ops.py：将新 .cu 文件加入两处编译源列表
fastdeploy/model_executor/layers/moe/moe.py：get_moe_scores 中 use_fused=True 时走新 grouped_topk 路径，替代原 fused_cast_sigmoid_bias + noaux_tc 双调用
tests/operators/test_grouped_topk_op.py：新增覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置的正确性与数值对齐测试

Usage or Command

N/A

Accuracy Tests

测试分支	并行方式	模型	主要对比	请求数量	引擎MAX BS	平均输入	平均输出	TPS	OTPS	QPS	TTFT(ms)	解码速度(tok/s)
develop	TP8	GLM4.5-Air	baseline	256	256	159.53	5433.42	3263.46	3170.38	0.583	1257.81	30.38
develop	TP8	GLM4.5-Air	fused_cast	256	256	159.53	5662.12	3330.47 (+2.06%)	3239.20	0.572	1282.50	30.70 (+1%)
develop	TP8	GLM4.5-Air	fused_cast_get_moe_score	256	256	159.53	5604.22	3392.78 (+3.95%)	3298.88	0.589 (+1%)	1458.27	30.54 (+0.5%)

fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比

config	T (token数)	E (专家数)	path_a(µs)	path_c(µs)	c/a	diff	idx
deepseek_v3	1	256	24.23	10.18	2.38x	0.00e+00	✓
deepseek_v3	8	256	24.55	11.66	2.11x	0.00e+00	✓
deepseek_v3	32	256	24.28	11.85	2.05x	0.00e+00	✓
deepseek_v3	128	256	24.37	11.87	2.05x	0.00e+00	✓
deepseek_v3	256	256	24.06	12.02	2.00x	0.00e+00	✓
deepseek_v3	512	256	23.91	12.27	1.95x	0.00e+00	✓
deepseek_v3	1024	256	24.16	20.73	1.17x	0.00e+00	✓
deepseek_v3	2048	256	26.77	31.05	0.86x	0.00e+00	✓
deepseek_v3	4096	256	35.95	48.19	0.75x	0.00e+00	✓
deepseek_v3	8192	256	60.40	77.83	0.78x	0.00e+00	✓
glm45_air	1	128	24.08	9.67	2.49x	0.00e+00	✓
glm45_air	8	128	23.89	9.79	2.44x	0.00e+00	✓
glm45_air	32	128	31.09	11.43	2.72x	0.00e+00	✓
glm45_air	128	128	24.34	11.45	2.13x	0.00e+00	✓
glm45_air	256	128	24.58	11.45	2.15x	0.00e+00	✓
glm45_air	512	128	24.54	11.56	2.12x	0.00e+00	✓
glm45_air	1024	128	24.55	11.94	2.06x	0.00e+00	✓
glm45_air	2048	128	24.54	13.05	1.88x	0.00e+00	✓
glm45_air	4096	128	26.33	17.21	1.53x	0.00e+00	✓
glm45_air	8192	128	40.93	29.50	1.39x	0.00e+00	✓

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

merge develop

paddle-bot · 2026-05-11T11:06:25Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-11T11:27:39Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-13 18:19:37

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: fff2a92
Merge base: 589a721 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 当前有 1 个 Required 任务失败（Approval 审批未通过），另有 7 个 Required 任务运行中，请处理审批后等待其余任务完成。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
38（0）	38	26	1	9	2	0

2 任务状态汇总

2.1 Required 任务：2/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	PR问题：缺少 FD RD 及 PaddlePaddle RD custom op 审批	联系 FD RD/PaddlePaddle RD 进行 review 审批	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	Job	-
⏳	`Run Stable Tests / stable_tests`	-	运行中	-	Job	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	运行中	-	Job	-
⏳	`Extracted partial CE model tasks / run_ce_cases`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	Job	-
✅	其余 2 个必选任务通过（Pre Commit、run_tests_logprob）	-	-	-	-	-

2.2 可选任务 — 24/28 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
⏳	`xpu_unit_test / run_xpu_unit_test`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
⏸️	`CI_HPU`	-	-	-
✅	其余 24 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 流程/审批问题（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 流程/审批问题
置信度: 高
根因摘要: PR 新增 custom op，缺少 FastDeploy RD 和 PaddlePaddle RD 各一次审批
分析器: 通用分析（fallback）

根因详情:
check_approval.sh 脚本检查到 PR 新增了 custom op，要求：

需要一位 FastDeploy RD（qingqing01/Jiang-Jia-Jun/heavengate）的 review 审批
需要一位 PaddlePaddle RD（jeff41404/yongqiangma）的 review 审批

目前两项均未满足，脚本报告 "There are 2 approved errors."，以 exit code 6 退出。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.

There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请联系以下任意一位 FastDeploy RD 进行 review 审批：@dangqingqing / @jiangjiajun / @DENGKAIPENG
请联系以下任意一位 PaddlePaddle RD 进行 review 审批：@gaoxiang / @mayongqiang

修复建议摘要: 联系 FD RD 及 PaddlePaddle RD 各进行一次 review 审批

链接: 查看日志

gongshaotian

LGTM

gongshaotian · 2026-05-11T12:32:33Z

-    from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import (
-        fused_cast_sigmoid_bias,
-    )
+    pass


codecov-commenter · 2026-05-11T13:22:16Z

Codecov Report

❌ Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@589a721). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...l_executor/layers/moe/fused_moe_cutlass_backend.py	66.66%	1 Missing and 1 partial ⚠️
fastdeploy/model_executor/layers/moe/moe.py	75.00%	1 Missing and 1 partial ⚠️
...el_executor/layers/moe/fused_moe_triton_backend.py	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7777   +/-   ##
==========================================
  Coverage           ?   63.15%           
==========================================
  Files              ?      461           
  Lines              ?    64138           
  Branches           ?     9824           
==========================================
  Hits               ?    40505           
  Misses             ?    20851           
  Partials           ?     2782

Flag	Coverage Δ
GPU	`72.27% <80.00%> (?)`
XPU	`7.13% <8.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…diff is 0.00.

gongshaotian

LGTM

yongqiangma

LGTM

…ontrol whether to use the kernel-fused path.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 18:36:14

📋 Review 摘要

PR 概述：新增 grouped_topk CUDA fused kernel，将 cast + sigmoid + bias + noaux_tc 四步融合为单次 kernel launch，替代原 fused_cast_sigmoid_bias + noaux_tc 双 kernel 路径，并通过新 flag enable_moe_scores_elementwise_fuse 控制启用。

变更范围：custom_ops/gpu_ops/（新 CUDA kernel）、fastdeploy/model_executor/layers/moe/（三个 backend + moe.py）、fastdeploy/engine/args_utils.py、fastdeploy/scheduler/config.py

影响面 Tag：[OP] [Optimization]

📝 PR 规范检查

标题 [Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc 存在两处问题：①包含两个 Tag（规范要求仅一个）；②[Op] 大小写不符（官方列表为 [OP]）；③Tag 与描述之间缺少空格。

标题建议（可直接复制）：

[OP] Kernel fusion: cast+sigmoid+bias+noauxtc

问题

级别	文件	概述
📝 PR 规范	PR 标题	两个 Tag、`[Op]` 大小写错误、Tag 后缺空格
❓ 疑问	`fastdeploy/engine/args_utils.py:344`	默认 `False` 将移除原有 `fused_cast_sigmoid_bias` 优化，存量部署性能回退
🟡 建议	`fastdeploy/model_executor/layers/moe/moe.py:135`	`use_fused_cast=True` + 冗余 EP 路径时，gating_output 未 cast 为 float32 即送入 `noaux_tc_redundant`
❓ 疑问	`fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py:363`	移除 `FD_ENABLE_RL` 保护，RL 模式下启用 fuse flag 时行为变化

总体评价

CUDA kernel 实现（BitonicSort / WarpSelect / Phase 1 & 2）逻辑完整，三个 backend 联动更新，测试覆盖四种典型模型配置，整体质量较好。主要关注点是默认值策略与边界路径的类型安全性，建议确认后合入。

PaddlePaddle-bot · 2026-05-13T10:39:12Z

    Chunk size of moe input.
    """

+    enable_moe_scores_elementwise_fuse: bool = False


❓ 疑问 enable_moe_scores_elementwise_fuse 默认 False 会移除原有默认激活的 fused_cast_sigmoid_bias 优化。

在此 PR 之前，非 RL CUDA 部署默认走 fused_cast_sigmoid_bias + noaux_tc（约 +2% TPS）；合入后默认退回纯 Python sigmoid + noaux_tc（等效 baseline）。

存量部署无需改动配置即会悄然丢失原有性能收益，建议在 help 文档或 Release Notes 中说明迁移方式，或评估是否对 CUDA 平台默认开启。

PaddlePaddle-bot · 2026-05-13T10:39:12Z

+            renormalize,
+            routed_scaling_factor,
+        )
    else:


🟡 建议 当 use_fused_cast=True 且 expert_id_to_ep_rank_array is not None（冗余 EP 路径）时，调用方（cutlass/triton backend）不会预先 cast gate_out 到 float32（if not use_fused: gate_out = gate_out.cast("float32") 被跳过），但此处 else 分支直接对 bfloat16/float16 的 gating_output 做 sigmoid 后与 float32 的 e_score_correction_bias 相加，可能引发隐式 cast 或 noaux_tc_redundant 的类型不匹配。

建议在此 else 分支入口添加 float32 cast 兜底：

else: if gating_output.dtype != paddle.float32: gating_output = gating_output.cast("float32") scores = paddle.nn.functional.sigmoid(gating_output)

PaddlePaddle-bot · 2026-05-13T10:39:12Z

        if fastdeploy.envs.FD_USE_PHI_MOE_PERMUTE and self.moe_quant_type == "w16a16":
            if layer.topk_method == "noaux_tc":
-                use_fused = not fastdeploy.envs.FD_ENABLE_RL and current_platform.is_cuda() and not fc1_latent_proj
+                use_fused = (


❓ 疑问 原有 not fastdeploy.envs.FD_ENABLE_RL 保护已被移除。当用户在 RL 训练模式（FD_ENABLE_RL=True）下同时启用 enable_moe_scores_elementwise_fuse=True 时，fused kernel 将运行于 RL 场景。

请确认：①原 RL 保护是有意为之（RL 模式下 fused kernel 存在正确性/兼容性问题），还是历史遗留约束可以移除？②若 RL 不兼容 fused 路径，需在此处补回保护或在文档中注明。

ShaneGZhu added 2 commits May 11, 2026 18:20

[Ops][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc

68443e9

Merge remote-tracking branch 'origin/develop' into get_moe_score

ec06609

merge develop

ShaneGZhu had a problem deploying to Metax_ci May 11, 2026 11:06 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Add unit_test file

49563e4

ShaneGZhu had a problem deploying to Metax_ci May 11, 2026 11:52 — with GitHub Actions Failure

ShaneGZhu marked this pull request as ready for review May 11, 2026 11:53

This comment was marked as outdated.

Sign in to view

gongshaotian previously approved these changes May 11, 2026

View reviewed changes

ShaneGZhu changed the title ~~[Ops][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc~~ [Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc May 11, 2026

Fixed the 1e-8 precision issue that had been introduced; the current …

a132799

…diff is 0.00.

ShaneGZhu dismissed gongshaotian’s stale review via a132799 May 12, 2026 08:31

ShaneGZhu had a problem deploying to Metax_ci May 12, 2026 08:31 — with GitHub Actions Error

clean

0495e88

ShaneGZhu had a problem deploying to Metax_ci May 12, 2026 08:35 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

gongshaotian previously approved these changes May 12, 2026

View reviewed changes

yongqiangma previously approved these changes May 12, 2026

View reviewed changes

Added the service parameter:--enable_moe_scores_elementwise_fuse to c…

6462239

…ontrol whether to use the kernel-fused path.

ShaneGZhu dismissed stale reviews from yongqiangma and gongshaotian via 6462239 May 13, 2026 07:43

ShaneGZhu had a problem deploying to Metax_ci May 13, 2026 07:43 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix unittest

fff2a92

ShaneGZhu had a problem deploying to Metax_ci May 13, 2026 10:07 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 13, 2026

View reviewed changes

yongqiangma approved these changes May 13, 2026

View reviewed changes

qingqing01 approved these changes May 13, 2026

View reviewed changes

Conversation

ShaneGZhu commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比

Checklist

Uh oh!

paddle-bot Bot commented May 11, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required 任务：2/10 通过

2.2 可选任务 — 24/28 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

gongshaotian May 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

yongqiangma left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ShaneGZhu commented May 11, 2026 •

edited

Loading

PaddlePaddle-bot commented May 11, 2026 •

edited

Loading

codecov-commenter commented May 11, 2026 •

edited

Loading